Introduction to Coding and Data Management - 2 | Project work - Jacopo Lazzari 894361¶

Summary¶

Introduction

The project will focus on an analysis of Fortune 1000 companies. In particular on Fortune 1000 list of 2022.

The Fortune 1000 are the 1000 largest American companies ranked by revenues for a specific year, as compiled by the magazine Fortune. The list only includes companies which are incorporated or authorized to do business in the United States, and for which revenues are publicly available. There are various types of companies coming from many different sectors and with different dimensions, from Big Tech companies such as Apple and Amazon to small companies.

While this list includes numerous prominent and highly successful companies in the country, it is important to note that it is not a flawless measure of success. Only publicly traded companies are considered, which means many privately-held businesses are excluded. Additionally, the list primarily relies on revenue generation rather than profitability as a criterion for inclusion.

Research Questions

I will focus on different topics and factors related to Fortune 1000 companies, trying to answer the following research questions.

Financial performances and sectors analysis:

  • Which companies generate the highest revenues? Which sector dominates the Fortune 1000 list in terms of the number of companies and revenues?
  • Are companies profitable? Is there a relationship with revenues? How is the profitability of companies distributed overall?
  • How do different sectors within the Fortune 500 compare in terms of financial performance metrics, such as profit margins and return on assets (ROA)?
  • Do revenues depend on the number of employees?

Focus on the geographical dimension:

  • How are headquarters of Fortune 1000 companies distributed across the United States?
  • Which states generate the highest revenues? Is there a correlation between population, state GDP, state unemployment and the number of Fortune 1000 companies?

Company policies:

  • Do companies with founder CEOs or female CEOs exhibit better financial performance compared to other companies?

Datasets choice

In order to answer to the different research questions, I have selected various datasets from different sources.

Fortune 2022 Dataset is the main dataset for this project. I downloaded it from Kaggle (https://www.kaggle.com/datasets/winston56/fortune-500-data-2021). The source seems reliable because the dataset is created through web scraping of the Fortune official website.

Fortune 2022 second Dataset is a secondary dataset that has some additional columns referring to the Fortune 1000 list that can be added to the main dataset. The source is always Kaggle (https://www.kaggle.com/datasets/ramjasmaurya/fortune-1000-companieslatest) and the dataset is created through scraping of the Fortune website.

US GDP by State Dataset is a dataset that contains the data for GDP (Gross Domestic Product) for each state of the United States in 2021. The source is the BEA (Bureau of Economic Analysis), an official United States government source (https://shorturl.at/bzHNX).

US States Abbreviations Dataset is includes abbreviations for each state of the United States. Source (Kaggle): https://www.kaggle.com/datasets/giodev11/usstates-dataset?select=state-abbrevs.csv.

US Unemployment Rates by State Dataset is a dataset that contains unemployment rates from 2017 to 2021. Source (Kaggle): https://www.kaggle.com/datasets/pasicebear/us-unemployment-rates-per-state-20172021?select=unemployment_rates.csv.

US Population by State Dataset is a dataset that has US population datas by state from 2020 to 2022. The source is the United States Census Bureau: https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html.

Libraries¶

First of all, here I import all the libraries needed for all the activities that I will carry out in order to complete the project.

In [1]:
%pip install numpy pandas matplotlib plotly seaborn scipy geopandas openpyxl

import os
import os.path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import copy as cp
import scipy.stats as stats
import geopandas as gpd
import folium
from branca.colormap import LinearColormap
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (1.24.2)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (1.5.3)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (3.7.1)
Requirement already satisfied: plotly in /usr/local/lib/python3.10/dist-packages (5.14.0)
Requirement already satisfied: seaborn in /usr/local/lib/python3.10/dist-packages (0.12.2)
Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (1.10.1)
Requirement already satisfied: geopandas in /usr/local/lib/python3.10/dist-packages (0.12.2)
Collecting openpyxl
  Downloading openpyxl-3.1.2-py2.py3-none-any.whl (249 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 250.0/250.0 KB 5.0 MB/s eta 0:00:00
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2023.3)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2.8.2)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.0.7)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (3.0.9)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.4.4)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (4.39.3)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (23.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (0.11.0)
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from plotly) (8.2.2)
Requirement already satisfied: shapely>=1.7 in /usr/local/lib/python3.10/dist-packages (from geopandas) (2.0.1)
Requirement already satisfied: fiona>=1.8 in /usr/local/lib/python3.10/dist-packages (from geopandas) (1.9.2)
Requirement already satisfied: pyproj>=2.6.1.post1 in /usr/local/lib/python3.10/dist-packages (from geopandas) (3.5.0)
Collecting et-xmlfile
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Requirement already satisfied: munch>=2.3.2 in /usr/local/lib/python3.10/dist-packages (from fiona>=1.8->geopandas) (2.5.0)
Requirement already satisfied: click~=8.0 in /usr/local/lib/python3.10/dist-packages (from fiona>=1.8->geopandas) (8.1.3)
Requirement already satisfied: cligj>=0.5 in /usr/local/lib/python3.10/dist-packages (from fiona>=1.8->geopandas) (0.7.2)
Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from fiona>=1.8->geopandas) (2022.12.7)
Requirement already satisfied: attrs>=19.2.0 in /usr/local/lib/python3.10/dist-packages (from fiona>=1.8->geopandas) (22.2.0)
Requirement already satisfied: click-plugins>=1.0 in /usr/local/lib/python3.10/dist-packages (from fiona>=1.8->geopandas) (1.1.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.1.2
Note: you may need to restart the kernel to use updated packages.

Datasets reading and cleaning¶

Fortune 1000 2022 (Main Dataset)¶

I read the main dataset that I will use for the project. It contains the list of Fortune 1000 companies along with many features of companies. Revenues, profits, and market cap are in millions of U.S. Dollars. .info() method shows pieces of information about the dataset columns, data types and missing values.

In [2]:
current_directory = os.getcwd()
csv_file_path = 'Fortune_1000.csv'
absolute_file_path = os.path.join(current_directory, csv_file_path)

Fortune_1000 = pd.read_csv(absolute_file_path, index_col='rank', encoding='utf-8')

Fortune_1000.info()
Fortune_1000
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 1 to 1000
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   company            1000 non-null   object 
 1   rank_change        1000 non-null   float64
 2   revenue            1000 non-null   float64
 3   profit             997 non-null    float64
 4   num. of employees  999 non-null    float64
 5   sector             1000 non-null   object 
 6   city               1000 non-null   object 
 7   state              1000 non-null   object 
 8   newcomer           1000 non-null   object 
 9   ceo_founder        1000 non-null   object 
 10  ceo_woman          1000 non-null   object 
 11  profitable         1000 non-null   object 
 12  prev_rank          1000 non-null   object 
 13  CEO                1000 non-null   object 
 14  Website            1000 non-null   object 
 15  Ticker             951 non-null    object 
 16  Market Cap         969 non-null    object 
dtypes: float64(4), object(13)
memory usage: 140.6+ KB
Out[2]:
company rank_change revenue profit num. of employees sector city state newcomer ceo_founder ceo_woman profitable prev_rank CEO Website Ticker Market Cap
rank
1 Walmart 0.0 572754.0 13673.0 2300000.0 Retailing Bentonville AR no no no yes 1.0 C. Douglas McMillon https://www.stock.walmart.com WMT 352037
2 Amazon 0.0 469822.0 33364.0 1608000.0 Retailing Seattle WA no no no yes 2.0 Andrew R. Jassy www.amazon.com AMZN 1202717
3 Apple 0.0 365817.0 94680.0 154000.0 Technology Cupertino CA no no no yes 3.0 Timothy D. Cook www.apple.com AAPL 2443962
4 CVS Health 0.0 292111.0 7910.0 258000.0 Health Care Woonsocket RI no no yes yes 4.0 Karen Lynch https://www.cvshealth.com CVS 125204
5 UnitedHealth Group 0.0 287597.0 17285.0 350000.0 Health Care Minnetonka MN no no no yes 5.0 Andrew P. Witty www.unitedhealthgroup.com UNH 500468
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
996 Vizio Holding 0.0 2124.0 -39.4 800.0 Industrials Irvine CA no yes no no William W. Wang https://www.vizio.com VZIO 1705.1
997 1-800-Flowers.com 0.0 2122.2 118.7 4800.0 Retailing Jericho NY no no no yes Christopher G. McCann https://www.1800flowers.com FLWS 830
998 Cowen 0.0 2112.8 295.6 1534.0 Financials New York NY no no no yes Jeffrey Solomon https://www.cowen.com COWN 1078
999 Ashland 0.0 2111.0 220.0 4100.0 Chemicals Wilmington DE no no no yes Guillermo Novo https://www.ashland.com ASH 5601.9
1000 DocuSign 0.0 2107.2 -70.0 7461.0 Technology San Francisco CA no no no no Allan C. Thygesen https://www.docusign.com DOCU 21302.8

1000 rows × 17 columns

As can be seen, there are some missing values. So, I search for NaN (Not A Number) values that could interfer the process of data analysis. I use isna() and sum() method in order to summarize the number of NaN values for each column.

In [3]:
Fortune_1000.isna().sum()
Out[3]:
company               0
rank_change           0
revenue               0
profit                3
num. of employees     1
sector                0
city                  0
state                 0
newcomer              0
ceo_founder           0
ceo_woman             0
profitable            0
prev_rank             0
CEO                   0
Website               0
Ticker               49
Market Cap           31
dtype: int64

I identify and print the rows in the Fortune_1000 data frame that have missing values.

In [4]:
Fortune_1000.loc[Fortune_1000.isna().any(axis=1), Fortune_1000.isna().any()]
Out[4]:
profit num. of employees Ticker Market Cap
rank
42 1280.9 53586.0 NaN NaN
72 277.1 14344.0 NaN -
76 4412.2 232000.0 NaN -
78 3068.0 45000.0 NaN NaN
80 1617.2 24134.0 NaN NaN
90 4060.7 15065.0 NaN -
95 554.0 9941.0 NaN NaN
96 3300.0 37335.0 NaN -
97 977.8 7585.0 NaN -
100 319.3 10052.0 NaN -
232 287.4 9000.0 NaN NaN
247 288.8 9150.0 NaN NaN
251 734.5 13590.0 NaN NaN
264 -366.8 11787.0 NaN NaN
266 1102.0 3861.0 NaN NaN
303 1605.0 50000.0 NaN -
313 753.0 28800.0 NaN NaN
324 579.0 5739.0 NaN NaN
339 1318.4 6441.0 NaN NaN
351 2530.2 3256.0 NaN NaN
368 1130.1 5805.0 NaN NaN
372 361.2 2538.0 NaN NaN
399 262.4 8800.0 NaN -
418 2842.4 5560.0 NaN NaN
464 311.5 6941.0 NaN NaN
499 NaN 15640.0 FYBR 6763.1
535 NaN 1300.0 CHK 11135.9
539 555.6 25150.0 NaN 27492.2
559 665.9 21700.0 NaN 3088.7
597 -221.4 7500.0 NaN 30976.8
602 622.3 4100.0 NaN NaN
623 455.1 17662.0 NaN 4185.4
647 522.9 3194.0 NaN NaN
650 739.9 4064.0 NaN NaN
659 -165.1 8900.0 NaN NaN
660 699.3 4545.0 NaN 5084.1
680 -27.0 6100.0 NaN 1856.2
690 419.2 3149.0 NaN NaN
710 199.0 9600.0 NaN 2519.4
714 260.8 3185.0 NaN NaN
736 NaN 7600.0 GTX 463.9
748 78.3 2834.0 NaN NaN
775 392.1 1157.0 NaN NaN
789 213.8 1196.0 NaN 3056.2
815 -112.2 15000.0 NaN 6799.3
866 133.7 NaN NaN NaN
874 -104.2 2952.0 NaN 10463.5
878 50.7 1146.0 NaN NaN
880 117.1 3510.0 NaN NaN
907 320.1 2702.0 NaN NaN
925 130.3 749.0 NaN NaN
948 -85.2 1082.0 NaN NaN

Since I won't use the Ticker values, that are just abbreviations used to identify publicly traded companies, and there are 49 NaN values in that column, I remove it from the dataframe. I do the same with the Market Cap column because it has many NaN values and I won't use it for the analysis. I also remove rank_change and prev_rank columns because these data are correlated to 2021 list so some data (such as rank in 2021 for a company that was not in the list) might be misleading.

In [5]:
Fortune_1000 = Fortune_1000.drop("Ticker", axis=1)
Fortune_1000 = Fortune_1000.drop("Market Cap", axis=1)
Fortune_1000 = Fortune_1000.drop("rank_change", axis=1)
Fortune_1000 = Fortune_1000.drop("prev_rank", axis=1)
Fortune_1000
Out[5]:
company revenue profit num. of employees sector city state newcomer ceo_founder ceo_woman profitable CEO Website
rank
1 Walmart 572754.0 13673.0 2300000.0 Retailing Bentonville AR no no no yes C. Douglas McMillon https://www.stock.walmart.com
2 Amazon 469822.0 33364.0 1608000.0 Retailing Seattle WA no no no yes Andrew R. Jassy www.amazon.com
3 Apple 365817.0 94680.0 154000.0 Technology Cupertino CA no no no yes Timothy D. Cook www.apple.com
4 CVS Health 292111.0 7910.0 258000.0 Health Care Woonsocket RI no no yes yes Karen Lynch https://www.cvshealth.com
5 UnitedHealth Group 287597.0 17285.0 350000.0 Health Care Minnetonka MN no no no yes Andrew P. Witty www.unitedhealthgroup.com
... ... ... ... ... ... ... ... ... ... ... ... ... ...
996 Vizio Holding 2124.0 -39.4 800.0 Industrials Irvine CA no yes no no William W. Wang https://www.vizio.com
997 1-800-Flowers.com 2122.2 118.7 4800.0 Retailing Jericho NY no no no yes Christopher G. McCann https://www.1800flowers.com
998 Cowen 2112.8 295.6 1534.0 Financials New York NY no no no yes Jeffrey Solomon https://www.cowen.com
999 Ashland 2111.0 220.0 4100.0 Chemicals Wilmington DE no no no yes Guillermo Novo https://www.ashland.com
1000 DocuSign 2107.2 -70.0 7461.0 Technology San Francisco CA no no no no Allan C. Thygesen https://www.docusign.com

1000 rows × 13 columns

I search for NaN values in the columns profit and num. of employees in order to fill them manually since they are just 4 values that can be easily found online.

In [6]:
Fortune_1000[Fortune_1000["profit"].isna()]
Out[6]:
company revenue profit num. of employees sector city state newcomer ceo_founder ceo_woman profitable CEO Website
rank
499 Frontier Communications 6411.0 NaN 15640.0 Telecommunications Norwalk CT no no no no Nick Jeffery https://www.frontier.com
535 Chesapeake Energy 5809.0 NaN 1300.0 Energy Oklahoma City OK no no no no Domenic J. Dell&#039;Osso Jr. https://www.chk.com
736 Garrett Motion 3633.0 NaN 7600.0 Motor Vehicles & Parts Plymouth MI no no no no Olivier Rabiller https://www.garrettmotion.com
In [7]:
Fortune_1000[Fortune_1000["num. of employees"].isna()]
Out[7]:
company revenue profit num. of employees sector city state newcomer ceo_founder ceo_woman profitable CEO Website
rank
866 Univision Communciations 2841.0 133.7 NaN Media New York NY no no no yes Wade C. Davis https://https://corporate.televisaunivision.com
In [8]:
nan_indices = Fortune_1000.index[Fortune_1000["profit"].isna()].tolist()
fill_values = [441.0, 945.0, 398.0]
Fortune_1000.loc[nan_indices, "profit"] = fill_values

Fortune_1000["num. of employees"].fillna(4500.0, inplace=True)

Fortune_1000["num. of employees"] = Fortune_1000["num. of employees"].astype(int)

Fortune_1000.isna().sum()
Out[8]:
company              0
revenue              0
profit               0
num. of employees    0
sector               0
city                 0
state                0
newcomer             0
ceo_founder          0
ceo_woman            0
profitable           0
CEO                  0
Website              0
dtype: int64

Now the dataset has no NaN values...

In [9]:
Fortune_1000.dtypes
Out[9]:
company               object
revenue              float64
profit               float64
num. of employees      int64
sector                object
city                  object
state                 object
newcomer              object
ceo_founder           object
ceo_woman             object
profitable            object
CEO                   object
Website               object
dtype: object

...and data types are the desired ones. So, the dataset is more suitable for the scope of the analysis, but I want to rename columns just to make everything more clear and readable.

In [10]:
Fortune_1000 = Fortune_1000.rename(columns={"company": "Company", "revenue": "Revenue", "profit": "Profit", "num. of employees": "Employees", "sector": "Sector", "city": "City", "state": "State", "newcomer": "New_Entry", "ceo_founder": "CEO_Founder", "ceo_woman": "CEO_Woman", "profitable": "Profitable", "CEO": "CEO", "Website": "Website"})
Fortune_1000
Out[10]:
Company Revenue Profit Employees Sector City State New_Entry CEO_Founder CEO_Woman Profitable CEO Website
rank
1 Walmart 572754.0 13673.0 2300000 Retailing Bentonville AR no no no yes C. Douglas McMillon https://www.stock.walmart.com
2 Amazon 469822.0 33364.0 1608000 Retailing Seattle WA no no no yes Andrew R. Jassy www.amazon.com
3 Apple 365817.0 94680.0 154000 Technology Cupertino CA no no no yes Timothy D. Cook www.apple.com
4 CVS Health 292111.0 7910.0 258000 Health Care Woonsocket RI no no yes yes Karen Lynch https://www.cvshealth.com
5 UnitedHealth Group 287597.0 17285.0 350000 Health Care Minnetonka MN no no no yes Andrew P. Witty www.unitedhealthgroup.com
... ... ... ... ... ... ... ... ... ... ... ... ... ...
996 Vizio Holding 2124.0 -39.4 800 Industrials Irvine CA no yes no no William W. Wang https://www.vizio.com
997 1-800-Flowers.com 2122.2 118.7 4800 Retailing Jericho NY no no no yes Christopher G. McCann https://www.1800flowers.com
998 Cowen 2112.8 295.6 1534 Financials New York NY no no no yes Jeffrey Solomon https://www.cowen.com
999 Ashland 2111.0 220.0 4100 Chemicals Wilmington DE no no no yes Guillermo Novo https://www.ashland.com
1000 DocuSign 2107.2 -70.0 7461 Technology San Francisco CA no no no no Allan C. Thygesen https://www.docusign.com

1000 rows × 13 columns

Fortune 1000 2022 second Dataset¶

I read another dataset containing Fortune 1000 2022 companies. I will use it just to import the assets column that is useful for my analysis. As the previous dataset, revenues, profits, assets, and market values are in millions of U.S. Dollars.

In [11]:
csv_file_path = 'fortune 1000 companies in 2022 - 2022.csv'
absolute_file_path = os.path.join(current_directory, csv_file_path)

Fortune_1000_sec = pd.read_csv(absolute_file_path, index_col = 'rank in 2022', encoding='utf-8')

Fortune_1000_sec.info()
Fortune_1000_sec
<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 1 to 1,000
Data columns (total 10 columns):
 #   Column                                Non-Null Count  Dtype 
---  ------                                --------------  ----- 
 0   Name                                  1000 non-null   object
 1   Revenue                               1000 non-null   object
 2   revenue(% change)                     1000 non-null   object
 3   profits in millions                   1000 non-null   object
                     1000 non-null   object
 5   assets                                1000 non-null   object
 6   market value                          1000 non-null   object
 7   change in rank of top 1000 companies  1000 non-null   object
 8   employees                             1000 non-null   object
 9   change in rank(500 only)
            1000 non-null   object
dtypes: object(10)
memory usage: 85.9+ KB
Out[11]:
Name Revenue revenue(% change) profits in millions profits % change\r assets market value change in rank of top 1000 companies employees change in rank(500 only)\r\n
rank in 2022
1 Walmart $572,754 2.40% $13,673 1.20% $244,860 $409,795 - 2,300,000 -
2 Amazon $469,822 21.70% $33,364 56.40% $420,549 $1,658,807.30 - 1,608,000 -
3 Apple $365,817 33.30% $94,680 64.90% $351,002 $2,849,537.60 - 154,000 -
4 CVS Health $292,111 8.70% $7,910 10.20% $232,999 $132,839.20 - 258,000 -
5 UnitedHealth Group $287,597 11.80% $17,285 12.20% $212,206 $479,830.30 - 350,000 -
... ... ... ... ... ... ... ... ... ... ...
996 Vizio Holding $2,124 4% -$39.40 -138.40% $935.80 $1,705.10 - 800 -
997 1-800-Flowers.com $2,122.20 42.50% $118.70 101.10% $1,076.70 $830 - 4,800 -
998 Cowen $2,112.80 30.20% $295.60 36.60% $8,748.80 $744.10 - 1,534 -
999 Ashland Global Holdings $2,111 -11.20% $220 - $6,612 $5,601.90 -130 4,100 -
1,000 DocuSign $2,107.20 45% -$70 - $2,541.30 $21,302.80 - 7,461 -

1000 rows × 10 columns

I search for NaN values and, apparently, there are no NaN values.

In [12]:
Fortune_1000_sec.isna().sum()
Out[12]:
Name                                    0
Revenue                                 0
revenue(% change)                       0
profits in millions                     0
profits % change\r                      0
assets                                  0
market value                            0
change in rank of top 1000 companies    0
employees                               0
change in rank(500 only)\r\n            0
dtype: int64

But, the fact is that the dataset missing values are not represented as NaN values but as "-". If I count the number of "-" in the data frame, I can see that, actually, there are many missing values.

In [13]:
count_missing_values = (Fortune_1000_sec.applymap(lambda x: '-' in str(x))).sum().sum()
print(count_missing_values)
2193

Quite many missing values. Since I'm only interested in the asset column, I search for missing values in that column.

In [14]:
count_assets = Fortune_1000_sec["assets"].str.count("-").sum()
print(count_assets)
0

There are no missing values in this column. I check the data type of the column assets.

In [15]:
assets_dataypes = Fortune_1000_sec["assets"].dtypes
print(assets_dataypes)
object

The data type is object so I want to convert it to float, but I can't because there are some symbols ("$" and ",") that would make it impossible to convert to float dtypes. I remove these symbols and then convert values to float.

In [16]:
Fortune_1000_sec["assets"] = Fortune_1000_sec["assets"].str.replace("$", "")
Fortune_1000_sec["assets"] = Fortune_1000_sec["assets"].str.replace(",", "")
Fortune_1000_sec["assets"] = Fortune_1000_sec["assets"].astype(float)
print(Fortune_1000_sec["assets"])
rank in 2022
1        244860.0
2        420549.0
3        351002.0
4        232999.0
5        212206.0
           ...   
996         935.8
997        1076.7
998        8748.8
999        6612.0
1,000      2541.3
Name: assets, Length: 1000, dtype: float64
/tmp/ipykernel_15/3639834353.py:1: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  Fortune_1000_sec["assets"] = Fortune_1000_sec["assets"].str.replace("$", "")

US GDP by State Dataset¶

I read a dataset containing 2021 GDP of each state of United States in millions of U.S. dollars.

In [17]:
csv_file_path = 'US GDP by State 2021.csv'
absolute_file_path = os.path.join(current_directory, csv_file_path)

US_GDP_State = pd.read_csv(absolute_file_path,encoding='utf-8',skiprows=3,skipfooter=13,engine='python')

US_GDP_State.info()
US_GDP_State
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   GeoFips  52 non-null     int64  
 1   GeoName  52 non-null     object 
 2   2021     52 non-null     float64
dtypes: float64(1), int64(1), object(1)
memory usage: 1.3+ KB
Out[17]:
GeoFips GeoName 2021
0 0 United States * 23315081.0
1 1000 Alabama 254109.7
2 2000 Alaska 57349.4
3 4000 Arizona 420026.7
4 5000 Arkansas 148676.1
5 6000 California 3373240.7
6 8000 Colorado 436359.5
7 9000 Connecticut 298395.2
8 10000 Delaware 81160.0
9 11000 District of Columbia 153670.5
10 12000 Florida 1255558.3
11 13000 Georgia 691626.9
12 15000 Hawaii 91096.1
13 16000 Idaho 96282.8
14 17000 Illinois 945673.8
15 18000 Indiana 412975.2
16 19000 Iowa 216860.2
17 20000 Kansas 191380.6
18 21000 Kentucky 237182.0
19 22000 Louisiana 258571.3
20 23000 Maine 77963.3
21 24000 Maryland 443929.9
22 25000 Massachusetts 641332.2
23 26000 Michigan 572205.8
24 27000 Minnesota 412458.6
25 28000 Mississippi 127307.7
26 29000 Missouri 358572.0
27 30000 Montana 58699.8
28 31000 Nebraska 146285.4
29 32000 Nevada 194486.6
30 33000 New Hampshire 99673.3
31 34000 New Jersey 682945.9
32 35000 New Mexico 109582.8
33 36000 New York 1901296.5
34 37000 North Carolina 662120.8
35 38000 North Dakota 63559.6
36 39000 Ohio 756617.2
37 40000 Oklahoma 215336.3
38 41000 Oregon 272190.9
39 42000 Pennsylvania 844496.5
40 44000 Rhode Island 66570.9
41 45000 South Carolina 269802.5
42 46000 South Dakota 61684.7
43 47000 Tennessee 427125.5
44 48000 Texas 2051768.6
45 49000 Utah 225340.3
46 50000 Vermont 37103.8
47 51000 Virginia 604957.6
48 53000 Washington 677489.5
49 54000 West Virginia 85434.2
50 55000 Wisconsin 368611.3
51 56000 Wyoming 41510.2
In [18]:
US_GDP_State.isna().sum()
Out[18]:
GeoFips    0
GeoName    0
2021       0
dtype: int64

There are no missing values. Now I remove the first row that is about the United States in general and I remove the column GeoFips that is not useful.

In [19]:
US_GDP_State = US_GDP_State.drop(US_GDP_State.loc[US_GDP_State["GeoName"]=="United States *"].index)
US_GDP_State = US_GDP_State.drop("GeoFips", axis=1)
US_GDP_State.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 1 to 51
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   GeoName  51 non-null     object 
 1   2021     51 non-null     float64
dtypes: float64(1), object(1)
memory usage: 948.0+ bytes

The dataframe contains 51 rows (50 states + the District of Columbia). In order to decide whether to keep or not the District of Columbia in the list of states, I check if there are companies from this district in the main dataset (Fortune_1000).

In [20]:
(Fortune_1000['State'] == 'DC').sum()
Out[20]:
5

Actually, there are 5 companies, so I decide to keep DC in the states of US list. I rename the dataframe columns just so that they can be more clear. I also reset the index starting from 0.

In [21]:
US_GDP_State = US_GDP_State.rename(columns={'GeoName': 'State', '2021': 'GDP_2021'})
US_GDP_State.reset_index(drop=True, inplace=True)
US_GDP_State
Out[21]:
State GDP_2021
0 Alabama 254109.7
1 Alaska 57349.4
2 Arizona 420026.7
3 Arkansas 148676.1
4 California 3373240.7
5 Colorado 436359.5
6 Connecticut 298395.2
7 Delaware 81160.0
8 District of Columbia 153670.5
9 Florida 1255558.3
10 Georgia 691626.9
11 Hawaii 91096.1
12 Idaho 96282.8
13 Illinois 945673.8
14 Indiana 412975.2
15 Iowa 216860.2
16 Kansas 191380.6
17 Kentucky 237182.0
18 Louisiana 258571.3
19 Maine 77963.3
20 Maryland 443929.9
21 Massachusetts 641332.2
22 Michigan 572205.8
23 Minnesota 412458.6
24 Mississippi 127307.7
25 Missouri 358572.0
26 Montana 58699.8
27 Nebraska 146285.4
28 Nevada 194486.6
29 New Hampshire 99673.3
30 New Jersey 682945.9
31 New Mexico 109582.8
32 New York 1901296.5
33 North Carolina 662120.8
34 North Dakota 63559.6
35 Ohio 756617.2
36 Oklahoma 215336.3
37 Oregon 272190.9
38 Pennsylvania 844496.5
39 Rhode Island 66570.9
40 South Carolina 269802.5
41 South Dakota 61684.7
42 Tennessee 427125.5
43 Texas 2051768.6
44 Utah 225340.3
45 Vermont 37103.8
46 Virginia 604957.6
47 Washington 677489.5
48 West Virginia 85434.2
49 Wisconsin 368611.3
50 Wyoming 41510.2

US States Abbreviations Dataset¶

The following dataset contains abbreviations for each state of United States.

In [22]:
csv_file_path = 'US States abbrevs.csv'
absolute_file_path = os.path.join(current_directory, csv_file_path)

US_States = pd.read_csv(absolute_file_path,encoding='utf-8')

US_States.info()
US_States
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   state         51 non-null     object
 1   abbreviation  51 non-null     object
dtypes: object(2)
memory usage: 944.0+ bytes
Out[22]:
state abbreviation
0 Alabama AL
1 Alaska AK
2 Arizona AZ
3 Arkansas AR
4 California CA
5 Colorado CO
6 Connecticut CT
7 Delaware DE
8 District of Columbia DC
9 Florida FL
10 Georgia GA
11 Hawaii HI
12 Idaho ID
13 Illinois IL
14 Indiana IN
15 Iowa IA
16 Kansas KS
17 Kentucky KY
18 Louisiana LA
19 Maine ME
20 Montana MT
21 Nebraska NE
22 Nevada NV
23 New Hampshire NH
24 New Jersey NJ
25 New Mexico NM
26 New York NY
27 North Carolina NC
28 North Dakota ND
29 Ohio OH
30 Oklahoma OK
31 Oregon OR
32 Maryland MD
33 Massachusetts MA
34 Michigan MI
35 Minnesota MN
36 Mississippi MS
37 Missouri MO
38 Pennsylvania PA
39 Rhode Island RI
40 South Carolina SC
41 South Dakota SD
42 Tennessee TN
43 Texas TX
44 Utah UT
45 Vermont VT
46 Virginia VA
47 Washington WA
48 West Virginia WV
49 Wisconsin WI
50 Wyoming WY

As always, I check for missing values and rename the columns.

In [23]:
US_States.isna().sum()
Out[23]:
state           0
abbreviation    0
dtype: int64
In [24]:
US_States = US_States.rename(columns={'state': 'State', 'abbreviation': 'Abbreviation'})
US_States
Out[24]:
State Abbreviation
0 Alabama AL
1 Alaska AK
2 Arizona AZ
3 Arkansas AR
4 California CA
5 Colorado CO
6 Connecticut CT
7 Delaware DE
8 District of Columbia DC
9 Florida FL
10 Georgia GA
11 Hawaii HI
12 Idaho ID
13 Illinois IL
14 Indiana IN
15 Iowa IA
16 Kansas KS
17 Kentucky KY
18 Louisiana LA
19 Maine ME
20 Montana MT
21 Nebraska NE
22 Nevada NV
23 New Hampshire NH
24 New Jersey NJ
25 New Mexico NM
26 New York NY
27 North Carolina NC
28 North Dakota ND
29 Ohio OH
30 Oklahoma OK
31 Oregon OR
32 Maryland MD
33 Massachusetts MA
34 Michigan MI
35 Minnesota MN
36 Mississippi MS
37 Missouri MO
38 Pennsylvania PA
39 Rhode Island RI
40 South Carolina SC
41 South Dakota SD
42 Tennessee TN
43 Texas TX
44 Utah UT
45 Vermont VT
46 Virginia VA
47 Washington WA
48 West Virginia WV
49 Wisconsin WI
50 Wyoming WY

US Unemployment Rates by State Dataset¶

I read the dataset containing unemployment rates by state from 2017 to 2022.

In [25]:
csv_file_path = 'unemployment_rates.csv'
absolute_file_path = os.path.join(current_directory, csv_file_path)

US_Unemployment = pd.read_csv(absolute_file_path,encoding='utf-8')

US_Unemployment.info()
US_Unemployment
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3621 entries, 0 to 3620
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   first_day_of_month  3621 non-null   object 
 1   state               3621 non-null   object 
 2   unemployment_rate   3621 non-null   float64
dtypes: float64(1), object(2)
memory usage: 85.0+ KB
Out[25]:
first_day_of_month state unemployment_rate
0 2017-01-01 California 5.2
1 2017-02-01 California 5.1
2 2017-03-01 California 5.0
3 2017-04-01 California 5.0
4 2017-05-01 California 4.9
... ... ... ...
3616 2022-07-01 District of Columbia 5.2
3617 2022-08-01 District of Columbia 5.1
3618 2022-09-01 District of Columbia 4.7
3619 2022-10-01 District of Columbia 4.8
3620 2022-11-01 District of Columbia 4.6

3621 rows × 3 columns

In [26]:
US_Unemployment.isna().sum()
Out[26]:
first_day_of_month    0
state                 0
unemployment_rate     0
dtype: int64

There are no NaN values, but since I have many data about years that are not interesting for scope of my research, I have to clean the dataset from all values regarding 2017-2020 and 2022. I'm only interested in 2021 information because Fortune 1000 2022 refers to revenues of the year 2021.

I convert the column first_day_of_month to datatime, then I create a boolean mask to filter only rows for the year 2021 and I apply it to the dataframe.

In [27]:
US_Unemployment['first_day_of_month'] = pd.to_datetime(US_Unemployment['first_day_of_month'])
mask = US_Unemployment['first_day_of_month'].dt.year == 2021
US_Unemployment = US_Unemployment[mask]
US_Unemployment.info()
US_Unemployment
<class 'pandas.core.frame.DataFrame'>
Int64Index: 612 entries, 48 to 3609
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   first_day_of_month  612 non-null    datetime64[ns]
 1   state               612 non-null    object        
 2   unemployment_rate   612 non-null    float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 19.1+ KB
Out[27]:
first_day_of_month state unemployment_rate
48 2021-01-01 California 8.8
49 2021-02-01 California 8.6
50 2021-03-01 California 8.4
51 2021-04-01 California 8.3
52 2021-05-01 California 7.9
... ... ... ...
3605 2021-08-01 District of Columbia 6.9
3606 2021-09-01 District of Columbia 6.5
3607 2021-10-01 District of Columbia 6.3
3608 2021-11-01 District of Columbia 6.1
3609 2021-12-01 District of Columbia 6.1

612 rows × 3 columns

Since I have monthly unemployment rates, I want to convert them to an unique annual unemployment rate. I extract the year from the column first_day_of_month and group the data by state and year, calculating the average unempolyment rate because I think it is the best way to summarize all the monthly data that are at disposal. I can also remove the column year because I only have data about 2021. I rename other columns.

In [28]:
US_Unemployment['year'] = US_Unemployment['first_day_of_month'].dt.year
US_Unemployment = US_Unemployment.groupby(['state', 'year'])['unemployment_rate'].mean().reset_index()

US_Unemployment.drop(columns=['year'], inplace=True)
US_Unemployment.rename(columns={'state': 'State', 'unemployment_rate': 'Unemployment'}, inplace=True)

US_Unemployment
/tmp/ipykernel_15/1859975808.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  US_Unemployment['year'] = US_Unemployment['first_day_of_month'].dt.year
Out[28]:
State Unemployment
0 Alabama 3.450000
1 Alaska 6.425000
2 Arizona 4.950000
3 Arkansas 4.058333
4 California 7.366667
5 Colorado 5.408333
6 Connecticut 6.300000
7 Delaware 5.391667
8 District of Columbia 6.675000
9 Florida 4.600000
10 Georgia 3.941667
11 Hawaii 5.775000
12 Idaho 3.591667
13 Illinois 6.125000
14 Indiana 3.616667
15 Iowa 4.275000
16 Kansas 3.283333
17 Kentucky 4.708333
18 Louisiana 5.475000
19 Maine 4.633333
20 Maryland 5.783333
21 Massachusetts 5.758333
22 Michigan 5.933333
23 Minnesota 3.425000
24 Mississippi 5.625000
25 Missouri 4.375000
26 Montana 3.400000
27 Nebraska 2.516667
28 Nevada 7.250000
29 New Hampshire 3.533333
30 New Jersey 6.391667
31 New Mexico 6.808333
32 New York 6.958333
33 North Carolina 4.841667
34 North Dakota 3.683333
35 Ohio 5.166667
36 Oklahoma 3.891667
37 Oregon 5.241667
38 Pennsylvania 6.416667
39 Rhode Island 5.675000
40 South Carolina 4.016667
41 South Dakota 3.133333
42 Tennessee 4.316667
43 Texas 5.708333
44 Utah 2.716667
45 Vermont 3.450000
46 Virginia 3.925000
47 Washington 5.283333
48 West Virginia 5.075000
49 Wisconsin 3.816667
50 Wyoming 4.516667

US Population by State Dataset¶

I read and clean the last dataset. It contains data about US population by state from 2020 to 2022.

In [29]:
excel_file_path = 'NST-EST2022-POP.xlsx'
absolute_file_path = os.path.join(current_directory, excel_file_path)

US_Population = pd.read_excel(absolute_file_path,skiprows=3,skipfooter=7)

US_Population.info()
US_Population
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56 entries, 0 to 55
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  56 non-null     object
 1   Unnamed: 1  56 non-null     int64 
 2   2020        56 non-null     int64 
 3   2021        56 non-null     int64 
 4   2022        56 non-null     int64 
dtypes: int64(4), object(1)
memory usage: 2.3+ KB
Out[29]:
Unnamed: 0 Unnamed: 1 2020 2021 2022
0 United States 331449520 331511512 332031554 333287557
1 Northeast 57609156 57448898 57259257 57040406
2 Midwest 68985537 68961043 68836505 68787595
3 South 126266262 126450613 127346029 128716192
4 West 78588565 78650958 78589763 78743364
5 .Alabama 5024356 5031362 5049846 5074296
6 .Alaska 733378 732923 734182 733583
7 .Arizona 7151507 7179943 7264877 7359197
8 .Arkansas 3011555 3014195 3028122 3045637
9 .California 39538245 39501653 39142991 39029342
10 .Colorado 5773733 5784865 5811297 5839926
11 .Connecticut 3605942 3597362 3623355 3626205
12 .Delaware 989957 992114 1004807 1018396
13 .District of Columbia 689546 670868 668791 671803
14 .Florida 21538226 21589602 21828069 22244823
15 .Georgia 10711937 10729828 10788029 10912876
16 .Hawaii 1455273 1451043 1447154 1440196
17 .Idaho 1839092 1849202 1904314 1939033
18 .Illinois 12812545 12786580 12686469 12582032
19 .Indiana 6785668 6788799 6813532 6833037
20 .Iowa 3190372 3190571 3197689 3200517
21 .Kansas 2937847 2937919 2937922 2937150
22 .Kentucky 4505893 4507445 4506589 4512310
23 .Louisiana 4657749 4651664 4627098 4590241
24 .Maine 1362341 1363557 1377238 1385340
25 .Maryland 6177213 6173205 6174610 6164660
26 .Massachusetts 7029949 6995729 6989690 6981974
27 .Michigan 10077325 10069577 10037504 10034113
28 .Minnesota 5706504 5709852 5711471 5717184
29 .Mississippi 2961288 2958141 2949586 2940057
30 .Missouri 6154920 6153998 6169823 6177957
31 .Montana 1084197 1087075 1106227 1122867
32 .Nebraska 1961489 1962642 1963554 1967923
33 .Nevada 3104624 3115648 3146402 3177772
34 .New Hampshire 1377518 1378587 1387505 1395231
35 .New Jersey 9289031 9271689 9267961 9261699
36 .New Mexico 2117527 2118390 2116677 2113344
37 .New York 20201230 20108296 19857492 19677151
38 .North Carolina 10439414 10449445 10565885 10698973
39 .North Dakota 779091 779518 777934 779261
40 .Ohio 11799374 11797517 11764342 11756058
41 .Oklahoma 3959346 3964912 3991225 4019800
42 .Oregon 4237291 4244795 4256301 4240137
43 .Pennsylvania 13002689 12994440 13012059 12972008
44 .Rhode Island 1097371 1096345 1096985 1093734
45 .South Carolina 5118429 5131848 5193266 5282634
46 .South Dakota 886677 887799 896164 909824
47 .Tennessee 6910786 6925619 6968351 7051339
48 .Texas 29145428 29232474 29558864 30029572
49 .Utah 3271614 3283785 3339113 3380800
50 .Vermont 643085 642893 646972 647064
51 .Virginia 8631384 8636471 8657365 8683619
52 .Washington 7705247 7724031 7740745 7785786
53 .West Virginia 1793755 1791420 1785526 1775156
54 .Wisconsin 5893725 5896271 5880101 5892539
55 .Wyoming 576837 577605 579483 581381
In [30]:
US_Population.isna().sum()
Out[30]:
Unnamed: 0    0
Unnamed: 1    0
2020          0
2021          0
2022          0
dtype: int64

There are no missing values.

The unnamed colums represent respectively: "Geographic Area" and "April 1, 2020 Estimates Base". They were unnamed because I skipped some rows of the excel file just to have the data frame in a better format.

Since I only want data about each state and not about geographic areas (such US as a whole or the Northeast, West, etc.), I want to rename the column as "State" and then drop the first 5 rows of values that are not useful.

I will then drop April 1, 2020 Estimates Base, 2020 and 2022 columns for the same reason.

In [31]:
US_Population.drop(columns=['Unnamed: 1'], inplace=True)
US_Population = US_Population.drop(US_Population.columns[[1, 3]], axis=1)
US_Population = US_Population.drop(range(5), axis=0)
US_Population.rename(columns={"Unnamed: 0": "State"}, inplace=True)
US_Population.head()
Out[31]:
State 2021
5 .Alabama 5049846
6 .Alaska 734182
7 .Arizona 7264877
8 .Arkansas 3028122
9 .California 39142991

Now I want to clean the values inside the column State because there are some "." that can interfere with data. Then I reset the data frame index so that it starts again from 0 and not from 5.

In [32]:
US_Population['State'] = US_Population['State'].str.replace(".", "")
US_Population.reset_index(drop=True, inplace=True)
US_Population
/tmp/ipykernel_15/1857523339.py:1: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  US_Population['State'] = US_Population['State'].str.replace(".", "")
Out[32]:
State 2021
0 Alabama 5049846
1 Alaska 734182
2 Arizona 7264877
3 Arkansas 3028122
4 California 39142991
5 Colorado 5811297
6 Connecticut 3623355
7 Delaware 1004807
8 District of Columbia 668791
9 Florida 21828069
10 Georgia 10788029
11 Hawaii 1447154
12 Idaho 1904314
13 Illinois 12686469
14 Indiana 6813532
15 Iowa 3197689
16 Kansas 2937922
17 Kentucky 4506589
18 Louisiana 4627098
19 Maine 1377238
20 Maryland 6174610
21 Massachusetts 6989690
22 Michigan 10037504
23 Minnesota 5711471
24 Mississippi 2949586
25 Missouri 6169823
26 Montana 1106227
27 Nebraska 1963554
28 Nevada 3146402
29 New Hampshire 1387505
30 New Jersey 9267961
31 New Mexico 2116677
32 New York 19857492
33 North Carolina 10565885
34 North Dakota 777934
35 Ohio 11764342
36 Oklahoma 3991225
37 Oregon 4256301
38 Pennsylvania 13012059
39 Rhode Island 1096985
40 South Carolina 5193266
41 South Dakota 896164
42 Tennessee 6968351
43 Texas 29558864
44 Utah 3339113
45 Vermont 646972
46 Virginia 8657365
47 Washington 7740745
48 West Virginia 1785526
49 Wisconsin 5880101
50 Wyoming 579483

Datasets combining and reshaping¶

US Dataset¶

In order to have a better visualization of the dataframes containing US data, I merge them in a single one.

In [33]:
US_merged = pd.merge(US_States, US_GDP_State, on='State')
US_merged = pd.merge(US_merged, US_Unemployment, on='State')
US_merged = pd.merge(US_merged, US_Population, on='State')
US_merged.info()
US_merged
<class 'pandas.core.frame.DataFrame'>
Int64Index: 51 entries, 0 to 50
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   State         51 non-null     object 
 1   Abbreviation  51 non-null     object 
 2   GDP_2021      51 non-null     float64
 3   Unemployment  51 non-null     float64
 4   2021          51 non-null     int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 2.4+ KB
Out[33]:
State Abbreviation GDP_2021 Unemployment 2021
0 Alabama AL 254109.7 3.450000 5049846
1 Alaska AK 57349.4 6.425000 734182
2 Arizona AZ 420026.7 4.950000 7264877
3 Arkansas AR 148676.1 4.058333 3028122
4 California CA 3373240.7 7.366667 39142991
5 Colorado CO 436359.5 5.408333 5811297
6 Connecticut CT 298395.2 6.300000 3623355
7 Delaware DE 81160.0 5.391667 1004807
8 District of Columbia DC 153670.5 6.675000 668791
9 Florida FL 1255558.3 4.600000 21828069
10 Georgia GA 691626.9 3.941667 10788029
11 Hawaii HI 91096.1 5.775000 1447154
12 Idaho ID 96282.8 3.591667 1904314
13 Illinois IL 945673.8 6.125000 12686469
14 Indiana IN 412975.2 3.616667 6813532
15 Iowa IA 216860.2 4.275000 3197689
16 Kansas KS 191380.6 3.283333 2937922
17 Kentucky KY 237182.0 4.708333 4506589
18 Louisiana LA 258571.3 5.475000 4627098
19 Maine ME 77963.3 4.633333 1377238
20 Montana MT 58699.8 3.400000 1106227
21 Nebraska NE 146285.4 2.516667 1963554
22 Nevada NV 194486.6 7.250000 3146402
23 New Hampshire NH 99673.3 3.533333 1387505
24 New Jersey NJ 682945.9 6.391667 9267961
25 New Mexico NM 109582.8 6.808333 2116677
26 New York NY 1901296.5 6.958333 19857492
27 North Carolina NC 662120.8 4.841667 10565885
28 North Dakota ND 63559.6 3.683333 777934
29 Ohio OH 756617.2 5.166667 11764342
30 Oklahoma OK 215336.3 3.891667 3991225
31 Oregon OR 272190.9 5.241667 4256301
32 Maryland MD 443929.9 5.783333 6174610
33 Massachusetts MA 641332.2 5.758333 6989690
34 Michigan MI 572205.8 5.933333 10037504
35 Minnesota MN 412458.6 3.425000 5711471
36 Mississippi MS 127307.7 5.625000 2949586
37 Missouri MO 358572.0 4.375000 6169823
38 Pennsylvania PA 844496.5 6.416667 13012059
39 Rhode Island RI 66570.9 5.675000 1096985
40 South Carolina SC 269802.5 4.016667 5193266
41 South Dakota SD 61684.7 3.133333 896164
42 Tennessee TN 427125.5 4.316667 6968351
43 Texas TX 2051768.6 5.708333 29558864
44 Utah UT 225340.3 2.716667 3339113
45 Vermont VT 37103.8 3.450000 646972
46 Virginia VA 604957.6 3.925000 8657365
47 Washington WA 677489.5 5.283333 7740745
48 West Virginia WV 85434.2 5.075000 1785526
49 Wisconsin WI 368611.3 3.816667 5880101
50 Wyoming WY 41510.2 4.516667 579483

And I rename some columns.

In [34]:
US_merged = US_merged.rename(columns={'GDP_2021': 'GDP($)', 'Unemployment': 'Unemployment(%)'})
US_merged.columns.values[4] = "Population"
US_merged
Out[34]:
State Abbreviation GDP($) Unemployment(%) Population
0 Alabama AL 254109.7 3.450000 5049846
1 Alaska AK 57349.4 6.425000 734182
2 Arizona AZ 420026.7 4.950000 7264877
3 Arkansas AR 148676.1 4.058333 3028122
4 California CA 3373240.7 7.366667 39142991
5 Colorado CO 436359.5 5.408333 5811297
6 Connecticut CT 298395.2 6.300000 3623355
7 Delaware DE 81160.0 5.391667 1004807
8 District of Columbia DC 153670.5 6.675000 668791
9 Florida FL 1255558.3 4.600000 21828069
10 Georgia GA 691626.9 3.941667 10788029
11 Hawaii HI 91096.1 5.775000 1447154
12 Idaho ID 96282.8 3.591667 1904314
13 Illinois IL 945673.8 6.125000 12686469
14 Indiana IN 412975.2 3.616667 6813532
15 Iowa IA 216860.2 4.275000 3197689
16 Kansas KS 191380.6 3.283333 2937922
17 Kentucky KY 237182.0 4.708333 4506589
18 Louisiana LA 258571.3 5.475000 4627098
19 Maine ME 77963.3 4.633333 1377238
20 Montana MT 58699.8 3.400000 1106227
21 Nebraska NE 146285.4 2.516667 1963554
22 Nevada NV 194486.6 7.250000 3146402
23 New Hampshire NH 99673.3 3.533333 1387505
24 New Jersey NJ 682945.9 6.391667 9267961
25 New Mexico NM 109582.8 6.808333 2116677
26 New York NY 1901296.5 6.958333 19857492
27 North Carolina NC 662120.8 4.841667 10565885
28 North Dakota ND 63559.6 3.683333 777934
29 Ohio OH 756617.2 5.166667 11764342
30 Oklahoma OK 215336.3 3.891667 3991225
31 Oregon OR 272190.9 5.241667 4256301
32 Maryland MD 443929.9 5.783333 6174610
33 Massachusetts MA 641332.2 5.758333 6989690
34 Michigan MI 572205.8 5.933333 10037504
35 Minnesota MN 412458.6 3.425000 5711471
36 Mississippi MS 127307.7 5.625000 2949586
37 Missouri MO 358572.0 4.375000 6169823
38 Pennsylvania PA 844496.5 6.416667 13012059
39 Rhode Island RI 66570.9 5.675000 1096985
40 South Carolina SC 269802.5 4.016667 5193266
41 South Dakota SD 61684.7 3.133333 896164
42 Tennessee TN 427125.5 4.316667 6968351
43 Texas TX 2051768.6 5.708333 29558864
44 Utah UT 225340.3 2.716667 3339113
45 Vermont VT 37103.8 3.450000 646972
46 Virginia VA 604957.6 3.925000 8657365
47 Washington WA 677489.5 5.283333 7740745
48 West Virginia WV 85434.2 5.075000 1785526
49 Wisconsin WI 368611.3 3.816667 5880101
50 Wyoming WY 41510.2 4.516667 579483

Fortune 1000 datasets joining¶

As I said before, I want to add the column assets from the secondary Fortune 1000 dataframe so that I can make more complete analysis on Fortune 1000 companies.

In [35]:
Fortune_1000.reset_index(drop=True, inplace=True)
Fortune_1000_sec.reset_index(drop=True, inplace=True)
Fortune_1000["Assets"] = Fortune_1000_sec["assets"]
Fortune_1000.info()
Fortune_1000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Company      1000 non-null   object 
 1   Revenue      1000 non-null   float64
 2   Profit       1000 non-null   float64
 3   Employees    1000 non-null   int64  
 4   Sector       1000 non-null   object 
 5   City         1000 non-null   object 
 6   State        1000 non-null   object 
 7   New_Entry    1000 non-null   object 
 8   CEO_Founder  1000 non-null   object 
 9   CEO_Woman    1000 non-null   object 
 10  Profitable   1000 non-null   object 
 11  CEO          1000 non-null   object 
 12  Website      1000 non-null   object 
 13  Assets       1000 non-null   float64
dtypes: float64(3), int64(1), object(10)
memory usage: 109.5+ KB
Out[35]:
Company Revenue Profit Employees Sector City State New_Entry CEO_Founder CEO_Woman Profitable CEO Website Assets
0 Walmart 572754.0 13673.0 2300000 Retailing Bentonville AR no no no yes C. Douglas McMillon https://www.stock.walmart.com 244860.0
1 Amazon 469822.0 33364.0 1608000 Retailing Seattle WA no no no yes Andrew R. Jassy www.amazon.com 420549.0
2 Apple 365817.0 94680.0 154000 Technology Cupertino CA no no no yes Timothy D. Cook www.apple.com 351002.0
3 CVS Health 292111.0 7910.0 258000 Health Care Woonsocket RI no no yes yes Karen Lynch https://www.cvshealth.com 232999.0
4 UnitedHealth Group 287597.0 17285.0 350000 Health Care Minnetonka MN no no no yes Andrew P. Witty www.unitedhealthgroup.com 212206.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 Vizio Holding 2124.0 -39.4 800 Industrials Irvine CA no yes no no William W. Wang https://www.vizio.com 935.8
996 1-800-Flowers.com 2122.2 118.7 4800 Retailing Jericho NY no no no yes Christopher G. McCann https://www.1800flowers.com 1076.7
997 Cowen 2112.8 295.6 1534 Financials New York NY no no no yes Jeffrey Solomon https://www.cowen.com 8748.8
998 Ashland 2111.0 220.0 4100 Chemicals Wilmington DE no no no yes Guillermo Novo https://www.ashland.com 6612.0
999 DocuSign 2107.2 -70.0 7461 Technology San Francisco CA no no no no Allan C. Thygesen https://www.docusign.com 2541.3

1000 rows × 14 columns

Just for design and legibility reasons, I re-order columns of my dataframe.

In [36]:
new_column_order = ['Company', 'Revenue', 'Assets','Profit','Employees', 'Sector', 'State','City', 'CEO_Founder', 'CEO_Woman','New_Entry', 'Profitable', 'CEO', 'Website']
Fortune_1000 = Fortune_1000.reindex(columns=new_column_order)
Fortune_1000
Out[36]:
Company Revenue Assets Profit Employees Sector State City CEO_Founder CEO_Woman New_Entry Profitable CEO Website
0 Walmart 572754.0 244860.0 13673.0 2300000 Retailing AR Bentonville no no no yes C. Douglas McMillon https://www.stock.walmart.com
1 Amazon 469822.0 420549.0 33364.0 1608000 Retailing WA Seattle no no no yes Andrew R. Jassy www.amazon.com
2 Apple 365817.0 351002.0 94680.0 154000 Technology CA Cupertino no no no yes Timothy D. Cook www.apple.com
3 CVS Health 292111.0 232999.0 7910.0 258000 Health Care RI Woonsocket no yes no yes Karen Lynch https://www.cvshealth.com
4 UnitedHealth Group 287597.0 212206.0 17285.0 350000 Health Care MN Minnetonka no no no yes Andrew P. Witty www.unitedhealthgroup.com
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 Vizio Holding 2124.0 935.8 -39.4 800 Industrials CA Irvine yes no no no William W. Wang https://www.vizio.com
996 1-800-Flowers.com 2122.2 1076.7 118.7 4800 Retailing NY Jericho no no no yes Christopher G. McCann https://www.1800flowers.com
997 Cowen 2112.8 8748.8 295.6 1534 Financials NY New York no no no yes Jeffrey Solomon https://www.cowen.com
998 Ashland 2111.0 6612.0 220.0 4100 Chemicals DE Wilmington no no no yes Guillermo Novo https://www.ashland.com
999 DocuSign 2107.2 2541.3 -70.0 7461 Technology CA San Francisco no no no no Allan C. Thygesen https://www.docusign.com

1000 rows × 14 columns

Data analysis and visualization¶

Financial performances and sectors analysis¶

In order to start finding asnwers to my research questions, I want to understand which companies generate the highest revenues and how revenues are distribute overall. I use interactive histograms and boxplots from the plotly library.

In [37]:
top_10_revenues = Fortune_1000.nlargest(10, 'Revenue')

fig = px.bar(top_10_revenues, x='Company', y='Revenue', labels={'Revenue': 'Revenues (millions $)'}, title='Top 10 Companies by Revenues')
fig.update_layout(xaxis_tickangle=-45)
fig.show()
In [38]:
revenue_bins = np.arange(0, 600000, 20000)

fig = px.histogram(Fortune_1000, x='Revenue', nbins=50, labels={'Revenue': 'Revenues (millions $)'}, title='Distribution of Revenues')
fig.update_layout(xaxis_range=[0, Fortune_1000['Revenue'].max()], xaxis_tickangle=-45)
fig.show()
In [39]:
fig = px.box(Fortune_1000, x='Revenue')
fig.update_layout(title='Distribution of Revenues', xaxis_title='Revenues (millions $)', yaxis_title='Distribution')
fig.show()

As we can see, there are some exceptions in terms of revenues, but companies are mainly distribute on ranges of less than 20 billions of revenues.

Now I want to understand which sector dominates the Fortune 1000 list. I use an histogram in orther to show the distribution of companies by sector and a scatter plot with revenues on a logarithmic scale, that helps design way better graph in case of outliers or big difference between higher and lower values.

In [40]:
sector_summary = Fortune_1000['Sector'].value_counts()

fig = px.bar(sector_summary, x=sector_summary.index, y=sector_summary.values, color=sector_summary.index)
fig.update_layout(xaxis_title='Sector', yaxis_title='Frequency', title='Distribution of Companies by Sector',
                  xaxis_tickangle=-90)
fig.show()
In [41]:
revenues_by_sector = Fortune_1000.groupby('Sector')['Revenue'].sum()

sectors = revenues_by_sector.index
total_revenues = revenues_by_sector.values

unique_sectors = Fortune_1000['Sector'].unique()

colors = plt.cm.tab20(np.linspace(0, 1, len(unique_sectors)))

plt.figure(figsize=(12, 6))
for sector, color in zip(unique_sectors, colors):
    sector_data = Fortune_1000[Fortune_1000['Sector'] == sector]
    frequencies = np.arange(1, len(sector_data) + 1)
    plt.scatter(frequencies, sector_data['Revenue'], color=color, label=sector)

plt.xlabel('Frequency')
plt.ylabel('Revenues (millions $)')
plt.title('Companies Revenues by Sector')
plt.xticks(rotation=90)
plt.yscale('log') 
plt.legend()
plt.tight_layout()
plt.show()

I use a log Scale so that I can better visualize data that I have. I can do it because revenues are only positive numbers (while it won't be possible with profits because losses are included).

It seems that the sectors with the largest number of companies also dominate in terms of revenues. But I want to search more deeply into this.

I compute the total revenues for each sector in order to have even better data.

In [42]:
revenues_by_sector = Fortune_1000.groupby('Sector')['Revenue'].sum()
sectors = revenues_by_sector.index
total_revenues = revenues_by_sector.values

fig = px.bar(x=sectors, y=total_revenues, color=sectors)
fig.update_layout(
    title='Total Revenues by Sector',
    xaxis=dict(title='Sector'),
    yaxis=dict(title='Total Revenues (millions $)'),
    xaxis_tickangle=-45
)
fig.show()
In [43]:
fig = go.Figure(data=[go.Pie(labels=sectors, values=total_revenues, marker=dict(colors=colors), textinfo='label')])
fig.update_layout(title='Total Revenues by Sector')
fig.show()

Financials, Retailing, Health Care, Technology and Energy are unquestioned leaders of the Fortune 1000 (and of economy in general, I would add) for total revenues.

Well, now I want to add profits to the analysis. I try to study the correlation of revenues and profits. Here I cannot use a log scale because profits can be negative (and there are some losses), so the graph is not the best one.

Using some advanced functions from the "scipy.stats" module, I can compute the regression line that studies the relationship between revenues and profits. I also add the correlation coefficient (r), that is the numerical measure of the direction and strength of the linear association, and R^2, that measures how well the regression line fits the data.

In [44]:
revenues = Fortune_1000['Revenue']
profits = Fortune_1000['Profit']

slope, intercept, r_value, p_value, std_err = stats.linregress(revenues, profits)
regression_line = lambda x: slope * x + intercept
correlation_coefficient = np.corrcoef(revenues, profits)[0, 1]
coefficient_of_determination = r_value**2

scatter_trace = go.Scatter(
    x=revenues,
    y=profits,
    mode='markers',
    marker=dict(color='blue', opacity=0.5),
    name='Data'
)

regression_trace = go.Scatter(
    x=revenues,
    y=regression_line(revenues),
    mode='lines',
    line=dict(color='red'),
    name='Regression Line'
)

layout = go.Layout(
    title='Revenues vs Profits',
    xaxis=dict(title='Revenues (millions $)'),
    yaxis=dict(title='Profits (millions $)'),
    showlegend=False,
    annotations=[
        dict(
            x=0.05,
            y=0.95,
            xref='paper',
            yref='paper',
            text=f'y = {slope:.2f}x + {intercept:.2f}',
            showarrow=False,
            font=dict(size=12),
            align='left'
        ),
        dict(
            x=0.05,
            y=0.90,
            xref='paper',
            yref='paper',
            text=f'Correlation (r): {correlation_coefficient:.2f}',
            showarrow=False,
            font=dict(size=12),
            align='left'
        ),
        dict(
            x=0.05,
            y=0.85,
            xref='paper',
            yref='paper',
            text=f'Squared correlation (R^2): {coefficient_of_determination:.2f}',
            showarrow=False,
            font=dict(size=12),
            align='left'
        )
    ]
)

fig = go.Figure(data=[scatter_trace, regression_trace], layout=layout)
fig.show()

It seems that there is a moderate positive correlation between revenues and profits (r=0.65). The linear model fits fairly well the data (R^2=0.42).

I also want to analyse profits distribution so I use histograms and densityplots.

In [45]:
top_10_profitable = Fortune_1000.nlargest(10, 'Profit')

bar_trace = go.Bar(
    x=top_10_profitable['Company'],
    y=top_10_profitable['Profit'],
    marker=dict(color='blue')
)

layout = go.Layout(
    title='Top 10 Most Profitable Companies',
    xaxis=dict(title='Company'),
    yaxis=dict(title='Profits (millions $)'),
    showlegend=False,
    xaxis_tickangle=-90
)

fig = go.Figure(data=[bar_trace], layout=layout)

fig.show()
In [46]:
histogram_trace = go.Histogram(
    x=Fortune_1000['Profit'],
    nbinsx=50,
    marker=dict(color='blue', line=dict(color='black', width=1))
)

layout = go.Layout(
    title='Distribution of Profits',
    xaxis=dict(title='Profits (millions $)'),
    yaxis=dict(title='Frequency'),
    showlegend=False
)

fig = go.Figure(data=[histogram_trace], layout=layout)
fig.show()
In [47]:
plt.figure(figsize=(10, 6))
sns.kdeplot(Fortune_1000['Profit'], fill=True)
plt.xlabel('Profits (millions $)')
plt.ylabel('Density')
plt.title('Density Plot of Profits')
plt.show()

I can analyse profitabily between sectors to see which are most profitable. Let's see with some other graphs.

In [48]:
profits_by_sector = Fortune_1000.groupby('Sector')['Profit'].sum()
sectors = profits_by_sector.index
total_profits = profits_by_sector.values

colors = [
    '#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd',
    '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf',
    '#b0c4de', '#ffb6c1', '#c0c0c0', '#66cdaa', '#ffa500',
    '#ffd700', '#d2691e', '#40e0d0', '#808080', '#008080',
    '#000080', '#800080', '#008000', '#800000', '#0000ff'
]

bar_trace = go.Bar(
    x=sectors,
    y=total_profits,
    marker=dict(color=colors[:len(sectors)])
)

layout = go.Layout(
    title='Total Profits by Sector',
    xaxis=dict(title='Sector'),
    yaxis=dict(title='Profit'),
    xaxis_tickangle=-45
)

fig = go.Figure(data=[bar_trace], layout=layout)
fig.show()
In [49]:
profits_by_sector = Fortune_1000.groupby('Sector')['Profit'].sum()
sectors = profits_by_sector.index
total_profits = profits_by_sector.values

colors = [
    '#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd',
    '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf',
    '#b0c4de', '#ffb6c1', '#c0c0c0', '#66cdaa', '#ffa500',
    '#ffd700', '#d2691e', '#40e0d0', '#808080', '#008080',
    '#000080', '#800080', '#008000', '#800000', '#0000ff'
]

pie_trace = go.Pie(
    labels=sectors,
    values=total_profits,
    marker=dict(colors=colors[:len(sectors)]),
    textinfo='none'  
)

layout = go.Layout(
    title='Total Profits by Sector'
)

fig = go.Figure(data=[pie_trace], layout=layout)
fig.show()

As predictable, leader sectors in revenues are also the most profitable: Financials, Technology, Health Care, Energy and Retailing.

Now lets give a look to profit margins. Profit Margin = (Net Profit / Revenue) * 100

It is a financial metric that measures the profitability of a company. It is expressed as a percentage and represents the portion of profit for each dollar of revenue. The higher, the better. A higher profit margin indicates that a company is generating more profit relative to its revenues, which is generally seen as a positive indicator of financial health and efficiency.

I add profit margins column to my dataframe Fortune_1000.

In [50]:
Fortune_1000['Profit_Margin(%)'] = (Fortune_1000['Profit'] / Fortune_1000['Revenue']) * 100
Fortune_1000
Out[50]:
Company Revenue Assets Profit Employees Sector State City CEO_Founder CEO_Woman New_Entry Profitable CEO Website Profit_Margin(%)
0 Walmart 572754.0 244860.0 13673.0 2300000 Retailing AR Bentonville no no no yes C. Douglas McMillon https://www.stock.walmart.com 2.387238
1 Amazon 469822.0 420549.0 33364.0 1608000 Retailing WA Seattle no no no yes Andrew R. Jassy www.amazon.com 7.101413
2 Apple 365817.0 351002.0 94680.0 154000 Technology CA Cupertino no no no yes Timothy D. Cook www.apple.com 25.881793
3 CVS Health 292111.0 232999.0 7910.0 258000 Health Care RI Woonsocket no yes no yes Karen Lynch https://www.cvshealth.com 2.707875
4 UnitedHealth Group 287597.0 212206.0 17285.0 350000 Health Care MN Minnetonka no no no yes Andrew P. Witty www.unitedhealthgroup.com 6.010146
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 Vizio Holding 2124.0 935.8 -39.4 800 Industrials CA Irvine yes no no no William W. Wang https://www.vizio.com -1.854991
996 1-800-Flowers.com 2122.2 1076.7 118.7 4800 Retailing NY Jericho no no no yes Christopher G. McCann https://www.1800flowers.com 5.593252
997 Cowen 2112.8 8748.8 295.6 1534 Financials NY New York no no no yes Jeffrey Solomon https://www.cowen.com 13.990913
998 Ashland 2111.0 6612.0 220.0 4100 Chemicals DE Wilmington no no no yes Guillermo Novo https://www.ashland.com 10.421601
999 DocuSign 2107.2 2541.3 -70.0 7461 Technology CA San Francisco no no no no Allan C. Thygesen https://www.docusign.com -3.321944

1000 rows × 15 columns

In [51]:
profit_margins_by_sector = Fortune_1000.groupby('Sector')['Profit_Margin(%)'].mean()
sectors = profit_margins_by_sector.index
margin_values = profit_margins_by_sector.values

colors = [
    '#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd',
    '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf',
    '#b0c4de', '#ffb6c1', '#c0c0c0', '#66cdaa', '#ffa500',
    '#ffd700', '#d2691e', '#40e0d0', '#808080', '#008080',
    '#000080', '#800080', '#008000', '#800000', '#0000ff'
]

bar_trace = go.Bar(
    x=sectors,
    y=margin_values,
    marker=dict(color=colors[:len(sectors)])
)

layout = go.Layout(
    xaxis=dict(title='Sector'),
    yaxis=dict(title='Profit Margin (%)'),
    title='Profit Margins by Sector',
    xaxis_tickangle=-45
)

fig = go.Figure(data=[bar_trace], layout=layout)
fig.show()

Analysing profit margins by sector, we can see that Retailing, Health Care, Technology, and Energy (unquestioned leaders for total revenues and profits) performs almost as the average of other sectors regarding profit margins. But there is Financials sector that is always best-performer and has about 5% profit margin more than every other sector.

Another interesting financial ratio that can be measured is ROA (Return On Assets) that evaluates a company's efficiency in generating profits from its assets. ROA = Net Income / Average Total Assets

Generally, profits can be considered equivalent to net income so I will perform an analysis using data from the dataframe Fortune_1000.

In [52]:
Fortune_1000['ROA(%)'] = (Fortune_1000['Profit'] / Fortune_1000['Assets']) * 100
Fortune_1000
Out[52]:
Company Revenue Assets Profit Employees Sector State City CEO_Founder CEO_Woman New_Entry Profitable CEO Website Profit_Margin(%) ROA(%)
0 Walmart 572754.0 244860.0 13673.0 2300000 Retailing AR Bentonville no no no yes C. Douglas McMillon https://www.stock.walmart.com 2.387238 5.584007
1 Amazon 469822.0 420549.0 33364.0 1608000 Retailing WA Seattle no no no yes Andrew R. Jassy www.amazon.com 7.101413 7.933439
2 Apple 365817.0 351002.0 94680.0 154000 Technology CA Cupertino no no no yes Timothy D. Cook www.apple.com 25.881793 26.974205
3 CVS Health 292111.0 232999.0 7910.0 258000 Health Care RI Woonsocket no yes no yes Karen Lynch https://www.cvshealth.com 2.707875 3.394864
4 UnitedHealth Group 287597.0 212206.0 17285.0 350000 Health Care MN Minnetonka no no no yes Andrew P. Witty www.unitedhealthgroup.com 6.010146 8.145387
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 Vizio Holding 2124.0 935.8 -39.4 800 Industrials CA Irvine yes no no no William W. Wang https://www.vizio.com -1.854991 -4.210301
996 1-800-Flowers.com 2122.2 1076.7 118.7 4800 Retailing NY Jericho no no no yes Christopher G. McCann https://www.1800flowers.com 5.593252 11.024426
997 Cowen 2112.8 8748.8 295.6 1534 Financials NY New York no no no yes Jeffrey Solomon https://www.cowen.com 13.990913 3.378749
998 Ashland 2111.0 6612.0 220.0 4100 Chemicals DE Wilmington no no no yes Guillermo Novo https://www.ashland.com 10.421601 3.327284
999 DocuSign 2107.2 2541.3 -70.0 7461 Technology CA San Francisco no no no no Allan C. Thygesen https://www.docusign.com -3.321944 -2.754496

1000 rows × 16 columns

Before analysing ROA, I want to give a look to total assets data. I want to see top 10 companies by total assets.

In [53]:
top_10_assets = Fortune_1000.nlargest(10, 'Assets')

bar_trace = go.Bar(
    x=top_10_assets['Company'],
    y=top_10_assets['Assets']
)

layout = go.Layout(
    xaxis=dict(title='Company'),
    yaxis=dict(title='Total Assets'),
    title='Top 10 Companies by Total Assets',
    xaxis_tickangle=-45
)

fig = go.Figure(data=[bar_trace], layout=layout)
fig.show()

As can be seen, all top 10 companies by total assets are from the financials sector with total assets of trillions of dollars. All of them provide financial services: banks, investment banks, insurance companies, holdings. Also, many of them are United States government-sponsored. Now, as done with revenues and profits, I want to see total assets by sectors.

In [54]:
assets_by_sector = Fortune_1000.groupby('Sector')['Assets'].sum().reset_index()

fig = px.bar(assets_by_sector, x='Sector', y='Assets', color='Sector', title='Total Assets by Sector',
             labels={'Assets': 'Total Assets'}, color_discrete_sequence=colors)
fig.update_layout(xaxis_tickangle=-45)
fig.show()
In [55]:
assets_by_sector = Fortune_1000.groupby('Sector')['Assets'].sum().reset_index()
sectors = assets_by_sector['Sector']
assets_values = assets_by_sector['Assets']

fig = go.Figure(data=[go.Pie(labels=sectors, values=assets_values, marker=dict(colors=colors), textinfo='none')])
fig.update_layout(title='Total Assets by Sector')
fig.show()

Financials dominates in terms of total assets due to the nature of their businesses: around 66% of total assets of Fortune 1000 companies are owned by companies from the Financials sector.

In [56]:
roa_by_sector = Fortune_1000.groupby('Sector')['ROA(%)'].mean()
sectors = roa_by_sector.index
roa_values = roa_by_sector.values

colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd',
          '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf',
          '#b0c4de', '#ffb6c1', '#c0c0c0', '#66cdaa', '#ffa500',
          '#ffd700', '#d2691e', '#40e0d0', '#808080', '#008080',
          '#000080', '#800080', '#008000', '#800000', '#0000ff']

bar_trace = go.Bar(
    x=sectors,
    y=roa_values,
    marker=dict(color=colors)
)

layout = go.Layout(
    xaxis=dict(title='Sector'),
    yaxis=dict(title='Return on Assets (ROA)'),
    title='ROA by Sector',
    xaxis_tickangle=-45
)

fig = go.Figure(data=[bar_trace], layout=layout)
fig.show()

But, considering ROA by sector, things change. In fact, even though they have so many assets, companies from financials sector have lower ROA than almost every other sector. Apparel sector (designi, manufacturing, and retailing of clothing, footwear, and accessories) dominates in terms of efficient use of assets.

In [57]:
correlation_matrix = Fortune_1000[['Assets', 'Revenue', 'Profit']].corr()

plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix: Assets, Revenues, and Profits')
plt.tight_layout()
plt.show()

Here I created a correlation matrix that shows that there is a positive correlation between every metric combined: revenues, assets, and profits. The stronger positive relationship is the one between revenues and profits (0.65), while it seems that there is only a weak correlation between revenues and assets (0.3).

Regarding the number of employees, I want to see which companies have the highest number of employees and the highest revenues per employees.

In [58]:
top_10_employees = Fortune_1000.nlargest(10, 'Employees')

fig = px.bar(top_10_employees, x='Company', y='Employees', color='Company')
fig.update_layout(
    xaxis_title='Company',
    yaxis_title='Number of Employees',
    title='Top 10 Companies by Number of Employees',
    xaxis_tickangle=-45
)

fig.show()

Walmart and Amazon have the highest number of employees, but in terms of Revenues per Employee, there are other companies dominating with impressing numbers.

In [59]:
Fortune_1000['Revenue_per_Employee'] = Fortune_1000['Revenue'] / Fortune_1000['Employees']
top_10_revenue_per_employee = Fortune_1000.nlargest(10, 'Revenue_per_Employee')

fig = px.bar(top_10_revenue_per_employee, x='Company', y='Revenue_per_Employee', color='Company')
fig.update_layout(
    xaxis_title='Company',
    yaxis_title='Revenue per Employee (millions $)',
    title='Top 10 Companies by Revenue per Employee',
    xaxis_tickangle=-45
)

fig.show()

Some numbers are actually really high: tens of millions of dollars per employee.

To see if actually revenues depend on the number of employees, I want to see the correlation between them.

In [60]:
revenues = Fortune_1000['Revenue']
employees = Fortune_1000['Employees']

slope, intercept, r_value, p_value, std_err = stats.linregress(employees, revenues)
r_squared = r_value ** 2

fig = go.Figure()

fig.add_trace(go.Scatter(
    x=employees,
    y=revenues,
    mode='markers',
    marker=dict(color='blue', opacity=0.5),
    name='Data Points'
))

x = np.linspace(0, employees.max(), 100)
y = intercept + slope * x
fig.add_trace(go.Scatter(
    x=x,
    y=y,
    mode='lines',
    line=dict(color='red'),
    name='Regression Line'
))

fig.update_layout(
    xaxis_title='Number of Employees',
    yaxis_title='Revenues (millions $)',
    title='Correlation between Revenues and Number of Employees',
    showlegend=True
)

regression_equation = f'Regression Line: y = {slope:.2f}x + {intercept:.2f}'
correlation_text = f'Correlation (r): {r_value:.2f}'
r_squared_text = f'R-squared: {r_squared:.2f}'

fig.add_annotation(
    xref='paper',
    yref='paper',
    x=0.98,
    y=0.9,
    text=regression_equation,
    showarrow=False,
    font=dict(color='white'),
    align='right',
    bgcolor='black',
    bordercolor='black',
    borderwidth=2
)

fig.add_annotation(
    xref='paper',
    yref='paper',
    x=0.98,
    y=0.85,
    text=correlation_text,
    showarrow=False,
    font=dict(color='white'),
    align='right',
    bgcolor='black',
    bordercolor='black',
    borderwidth=2
)

fig.add_annotation(
    xref='paper',
    yref='paper',
    x=0.98,
    y=0.8,
    text=r_squared_text,
    showarrow=False,
    font=dict(color='white'),
    align='right',
    bgcolor='black',
    bordercolor='black',
    borderwidth=2
)

fig.show()

The graph is stretched because of outliers in the number of employees (we saw that these companies are Walmart and Amazon), but a positive quite strong relationship can be found anyway (0.73 correlation r). R-squared is 0.54, meaning that data fits the regression line in a quite good way. So it is possible to conclude that, generally speaking, the higher the number of employees, the higher the revenues.

Focus on the geographical dimension¶

Too have a look on the geographical dimension and distribution of Fortune 1000 companies, I want to draw a map of United States with the distribution of headquarters.

In order to draw the map, I downloaded the shapefile of US State Boundaries from this website: https://hub.arcgis.com/datasets/1612d351695b467eba75fdf82c10884f/explore?location=16.090374%2C0.315549%2C2.18&showTable=true.

Before starting, I create a copy of the dataframes in order to avoid errors.

In [61]:
Fortune_1000_copy = cp.deepcopy(Fortune_1000)
Fortune_1000_copy.head()
Out[61]:
Company Revenue Assets Profit Employees Sector State City CEO_Founder CEO_Woman New_Entry Profitable CEO Website Profit_Margin(%) ROA(%) Revenue_per_Employee
0 Walmart 572754.0 244860.0 13673.0 2300000 Retailing AR Bentonville no no no yes C. Douglas McMillon https://www.stock.walmart.com 2.387238 5.584007 0.249023
1 Amazon 469822.0 420549.0 33364.0 1608000 Retailing WA Seattle no no no yes Andrew R. Jassy www.amazon.com 7.101413 7.933439 0.292178
2 Apple 365817.0 351002.0 94680.0 154000 Technology CA Cupertino no no no yes Timothy D. Cook www.apple.com 25.881793 26.974205 2.375435
3 CVS Health 292111.0 232999.0 7910.0 258000 Health Care RI Woonsocket no yes no yes Karen Lynch https://www.cvshealth.com 2.707875 3.394864 1.132213
4 UnitedHealth Group 287597.0 212206.0 17285.0 350000 Health Care MN Minnetonka no no no yes Andrew P. Witty www.unitedhealthgroup.com 6.010146 8.145387 0.821706
In [62]:
US_merged_copy = cp.deepcopy(US_merged)
US_merged_copy.head()
Out[62]:
State Abbreviation GDP($) Unemployment(%) Population
0 Alabama AL 254109.7 3.450000 5049846
1 Alaska AK 57349.4 6.425000 734182
2 Arizona AZ 420026.7 4.950000 7264877
3 Arkansas AR 148676.1 4.058333 3028122
4 California CA 3373240.7 7.366667 39142991
In [63]:
current_dir = os.getcwd()
shapefile_path = os.path.join(current_dir, 'US_State_Boundaries.shp')
us_map = gpd.read_file(shapefile_path)

state_counts = Fortune_1000_copy['State'].value_counts()

all_states = pd.Series(0, index=us_map['STATE_ABBR'], name='State_Count')

all_states.update(state_counts)

merged_map = us_map.merge(all_states, left_on='STATE_ABBR', right_index=True, how='left')

color_map = LinearColormap(['white', 'blue'], vmin=0, vmax=merged_map['State_Count'].max(), caption='State Count')

m = folium.Map(location=[37, -95], zoom_start=4)

folium.GeoJson(merged_map,
               name='Choropleth',
               style_function=lambda feature: {
                   'fillColor': color_map(feature['properties']['State_Count']),
                   'color': 'black',
                   'weight': 0.8,
                   'fillOpacity': 0.7
               },
               highlight_function=lambda x: {'weight': 3},
               tooltip=folium.features.GeoJsonTooltip(fields=['NAME', 'State_Count'],
                                                      aliases=['State', 'Count'],
                                                      sticky=True,
                                                      style='background-color: white; color: #333333;'),
               show=False
               ).add_to(m)

color_map.add_to(m)
m
Out[63]:
Make this Notebook Trusted to load map: File -> Trust Notebook

As can be seen, states with most of companies headquarters are California (CA), Texas (TX), and New York (NY). Many companies are also located in Illinois (IL), Ohio (OH) and Pennsylvania (PA).

In [64]:
states = ['CA', 'TX', 'NY', 'IL', 'OH', 'PA']

filtered_df = Fortune_1000[Fortune_1000['State'].isin(states)]

total_companies = len(filtered_df)

print("Total number of companies from CA, TX, NY, IL, OH, PA:", total_companies)
Total number of companies from CA, TX, NY, IL, OH, PA: 476

Together these 6 states sum up to 476 companies out of Fortune 1000. Almost half of the companies' headquarters are located in 6 states out of 50(+1). This shows a great discrepancy in the distribution of Fortune 1000 companies.

Now I want to understand which states generate the highest revenues and whether there is a correlation between population, GDP, and unemployment rate and the number of Fortune 1000 companies per state.

In [65]:
revenue_by_state = Fortune_1000_copy.groupby('State')['Revenue'].sum().reset_index()
revenue_by_state = revenue_by_state.sort_values('Revenue', ascending=False)

fig = go.Figure(data=go.Bar(x=revenue_by_state['State'], y=revenue_by_state['Revenue']))

fig.update_layout(
    title='Total Revenue by State',
    xaxis_title='State',
    yaxis_title='Total Revenue',
    xaxis_tickangle=-45
)

fig.show()

The next maps I make will not include Alaska in order to have a better graphic design. It should make no difference in the analysis because, as you can see from the map above, there are no Fortune 1000 companies in Alaska.

In [66]:
revenue_by_state = Fortune_1000_copy.groupby('State')['Revenue'].sum().reset_index()

current_dir = os.getcwd()
shapefile_path = os.path.join(current_dir, 'US_State_Boundaries.shp')
us_map = gpd.read_file(shapefile_path)

merged_map = us_map.merge(revenue_by_state, left_on='STATE_ABBR', right_on='State', how='left')

fig = go.Figure(data=go.Choropleth(
    locations=merged_map['STATE_ABBR'],
    z=merged_map['Revenue'],
    locationmode='USA-states',
    colorscale='Blues',
    colorbar_title='Revenues',
))

fig.update_layout(
    title_text='Revenues per State (Millions $)',
    geo_scope='usa',
)

fig.show()

Out of the first 8 states sorted by total revenues, 6 of them are between those with most of companies' headquartes (those seen before). There is a big concentration of revenues in those states. Let's see how much.

In [67]:
states = ['CA', 'TX', 'NY', 'IL', 'OH', 'PA']

filtered_df = Fortune_1000[Fortune_1000['State'].isin(states)]

total_revenue_selected = filtered_df['Revenue'].sum()

print("Total revenue of companies from CA, TX, NY, IL, OH, PA: $", total_revenue_selected, " millions.", sep="")
Total revenue of companies from CA, TX, NY, IL, OH, PA: $8910237.5 millions.

That compared to the total revenues of all Fortune 1000...

In [68]:
total_revenue = Fortune_1000['Revenue'].sum()

print("Total revenue of all Fortune 1000 companies: $", total_revenue, " millions.", sep="")
Total revenue of all Fortune 1000 companies: $17986801.4 millions.
In [69]:
8910237.5/17986801.4
Out[69]:
0.4953764319652743

...are about the half. This means that half of the companies' headquarters are concentrated in 6 states (CA, TX, NY, IL, OH, PA) and in total they generate half of the revenues of all Fortune 1000.

Now let's see which states have the largest populations, GDP, and GDP per capita in order to search for possible correlations.

In [70]:
current_dir = os.getcwd()
shapefile_path = os.path.join(current_dir, 'US_State_Boundaries.shp')
us_map = gpd.read_file(shapefile_path)

merged_map = us_map.merge(US_merged_copy[['Abbreviation', 'Population']], left_on='STATE_ABBR', right_on='Abbreviation', how='left')

fig = go.Figure(data=go.Choropleth(
    locations=merged_map['STATE_ABBR'],
    z=merged_map['Population'],
    locationmode='USA-states',
    colorscale='Blues',
    colorbar_title='Population',
))

fig.update_layout(
    title_text='Population per State',
    geo_scope='usa',
)

fig.show()

Those states seem to be also the most populated (together with Florida).

In [71]:
current_dir = os.getcwd()
shapefile_path = os.path.join(current_dir, 'US_State_Boundaries.shp')
us_map = gpd.read_file(shapefile_path)

merged_map = us_map.merge(US_merged_copy[['Abbreviation', 'GDP($)']], left_on='STATE_ABBR', right_on='Abbreviation', how='left')

fig = go.Figure(data=go.Choropleth(
    locations=merged_map['STATE_ABBR'],
    z=merged_map['GDP($)'],
    locationmode='USA-states',
    colorscale='Blues',
    colorbar_title='GDP',
))

fig.update_layout(
    title_text='GDP per State (Millions $)',
    geo_scope='usa',
)

fig.show()

And also those with the highest GDP.

In order to visualise GPD per capita in a better and understandable way, I exclude District of Columbia that represents an outlier for the rest of the dataframe with almost 230k per capita.

In [72]:
US_merged_copy['GDP_per_capita'] = US_merged_copy['GDP($)'] / US_merged_copy['Population']

current_dir = os.getcwd()
shapefile_path = os.path.join(current_dir, 'US_State_Boundaries.shp')
us_map = gpd.read_file(shapefile_path)

merged_map = us_map.merge(US_merged_copy[['Abbreviation', 'GDP_per_capita']], left_on='STATE_ABBR', right_on='Abbreviation', how='left')

merged_map = merged_map[merged_map['NAME'] != 'District of Columbia']

fig = go.Figure(data=go.Choropleth(
    locations=merged_map['STATE_ABBR'],
    z=merged_map['GDP_per_capita'],
    locationmode='USA-states',
    colorscale='Blues',
    colorbar_title='GDP per capita',
))

fig.update_layout(
    title_text='GDP per Capita in US (Millions $)',
    geo_scope='usa',
)

fig.show()
In [73]:
US_merged_copy[US_merged_copy.State=="District of Columbia"]
Out[73]:
State Abbreviation GDP($) Unemployment(%) Population GDP_per_capita
8 District of Columbia DC 153670.5 6.675 668791 0.229774

Observing GDP per capita, there are many states with values higher or similar to the 6 states seen before. So, I can think that the main correlation of frequency of companies and revenues is with population and total GDP, not with GDP per capita.

In [74]:
current_dir = os.getcwd()
shapefile_path = os.path.join(current_dir, 'US_State_Boundaries.shp')
us_map = gpd.read_file(shapefile_path)

merged_map = us_map.merge(US_merged_copy[['Abbreviation', 'Unemployment(%)']], left_on='STATE_ABBR', right_on='Abbreviation', how='left')

fig = go.Figure(data=go.Choropleth(
    locations=merged_map['STATE_ABBR'],
    z=merged_map['Unemployment(%)'],
    locationmode='USA-states',
    colorscale='Blues',
    colorbar_title='Unemployment(%)',
))

fig.update_layout(
    title_text='Unemployment rate by State (%)',
    geo_scope='usa',
)

fig.show()

Analysing unemployment rate by state it seems that actually states with the highest number of companies are also some of those with the highest one. This might be interesting because usually lower unemployment means higher GDP and also higher unemployment rates usually means weaker economies (that means unfavourable conditions for companies to establish themselves here). On the other hand, usually, when unemployment rates are high, it may indicate a surplus of available labor, which can potentially lead to lower wages and reduced labor costs for businesses. In such cases, businesses may have more flexibility to hire additional workers or invest in technologies that enhance productivity. There are so many macroeconomics factors to consider that it is impossible to find perfect correlations and draw conclusions.

I want to plot some scatter plots and study linear relationships in order to find out more on the relationships and answer the research question.

In [75]:
num_companies_by_state = Fortune_1000_copy['State'].value_counts().reset_index()
num_companies_by_state.columns = ['State', 'Number_of_Companies']

merged_df = US_merged_copy.merge(num_companies_by_state, left_on='Abbreviation', right_on='State', how='left')

merged_df['Number_of_Companies'] = merged_df['Number_of_Companies'].fillna(0)

fig = px.scatter(merged_df, x='Population', y='GDP($)', size='Number_of_Companies',
                 color='State_x', hover_data=['State_x'], title='Relationship between Population, State GDP, and Number of Fortune 1000 Companies')

fig.update_layout(
    xaxis_title='Population',
    yaxis_title='State GDP',
    showlegend=True,
    hovermode='closest'
)

fig.show()
In [76]:
merged_df['Population'] = merged_df['Population'] / 1e6

fig = px.scatter(merged_df, x='Population', y='Number_of_Companies', size='GDP($)',
                 color='State_x', hover_data=['State_x'], title='Relationship between Population and Number of Fortune 1000 Companies')

slope, intercept, r_value, p_value, std_err = stats.linregress(merged_df['Population'], merged_df['Number_of_Companies'])
line = slope * merged_df['Population'] + intercept

fig.add_trace(px.line(x=merged_df['Population'], y=line).data[0])

r_squared = r_value**2

equation_text = f"y = {slope:.2f}x + {intercept:.2f}"
correlation_text = f"Correlation (r): {r_value:.2f}"
r_squared_text = f"R-squared: {r_squared:.2f}"

fig.add_annotation(x=0.95, y=0.9, xref="paper", yref="paper",
                   text=equation_text, showarrow=False, font=dict(size=12))
fig.add_annotation(x=0.95, y=0.8, xref="paper", yref="paper",
                   text=correlation_text, showarrow=False, font=dict(size=12))
fig.add_annotation(x=0.95, y=0.7, xref="paper", yref="paper",
                   text=r_squared_text, showarrow=False, font=dict(size=12))

fig.update_layout(
    xaxis_title='Population (Millions)',
    yaxis_title='Number of Fortune 1000 Companies',
    showlegend=True,
    hovermode='closest'
)

fig.show()

As suspected before, there seems to be a strong positive relationship between population and the number of Fortune 1000 companies. The higher the number of people in a state, the higher the number of companies.

In [77]:
merged_df['GDP($)'] = merged_df['GDP($)'] / 1e6
fig = px.scatter(merged_df, x='GDP($)', y='Number_of_Companies', size='Population',
                 color='State_x', hover_data=['State_x'], title='Relationship between GDP and Number of Fortune 1000 Companies')

slope, intercept, r_value, p_value, std_err = stats.linregress(merged_df['GDP($)'], merged_df['Number_of_Companies'])
line = slope * merged_df['GDP($)'] + intercept

fig.add_trace(px.line(x=merged_df['GDP($)'], y=line).data[0])

r_squared = r_value**2

equation_text = f"y = {slope:.2f}x + {intercept:.2f}"
correlation_text = f"Correlation (r): {r_value:.2f}"
r_squared_text = f"R-squared: {r_squared:.2f}"

fig.add_annotation(x=0.95, y=0.9, xref="paper", yref="paper",
                   text=equation_text, showarrow=False, font=dict(size=12))
fig.add_annotation(x=0.95, y=0.8, xref="paper", yref="paper",
                   text=correlation_text, showarrow=False, font=dict(size=12))
fig.add_annotation(x=0.95, y=0.7, xref="paper", yref="paper",
                   text=r_squared_text, showarrow=False, font=dict(size=12))

fig.update_layout(
    xaxis_title='GDP (Millions $)',
    yaxis_title='Number of Fortune 1000 Companies',
    showlegend=True,
    hovermode='closest'
)

fig.show()

Also for the relationship between GDP and the number of Fortune 1000 companies there seems to be a strong positive relationship as suspected.

In [78]:
fig = px.scatter(merged_df, x='GDP_per_capita', y='Number_of_Companies', size='Population',
                 color='State_x', hover_data=['State_x'], title='Relationship between GDP per capita and Number of Fortune 1000 Companies')

slope, intercept, r_value, p_value, std_err = stats.linregress(merged_df['GDP_per_capita'], merged_df['Number_of_Companies'])
line = slope * merged_df['GDP_per_capita'] + intercept

fig.add_trace(px.line(x=merged_df['GDP_per_capita'], y=line).data[0])

r_squared = r_value**2

equation_text = f"y = {slope:.2f}x + {intercept:.2f}"
correlation_text = f"Correlation (r): {r_value:.2f}"
r_squared_text = f"R-squared: {r_squared:.2f}"

fig.add_annotation(x=0.95, y=0.9, xref="paper", yref="paper",
                   text=equation_text, showarrow=False, font=dict(size=12))
fig.add_annotation(x=0.95, y=0.8, xref="paper", yref="paper",
                   text=correlation_text, showarrow=False, font=dict(size=12))
fig.add_annotation(x=0.95, y=0.7, xref="paper", yref="paper",
                   text=r_squared_text, showarrow=False, font=dict(size=12))

fig.update_layout(
    xaxis_title='GDP per capita (Millions $)',
    yaxis_title='Number of Fortune 1000 Companies',
    showlegend=True,
    hovermode='closest'
)

fig.show()

Instead, there is no relationship between GDP per capita and the number of Fortune 1000 companies. The correlation r and R-squared coefficients are very low in fact, maybe also because of the District of Columbia (that is a strong outlier, a small point in the graph on the right).

In [79]:
fig = px.scatter(merged_df, x='Unemployment(%)', y='Number_of_Companies', size='Population',
                 color='State_x', hover_data=['State_x'], title='Relationship between Unemployment Rates and Number of Fortune 1000 Companies')

slope, intercept, r_value, p_value, std_err = stats.linregress(merged_df['Unemployment(%)'], merged_df['Number_of_Companies'])
line = slope * merged_df['Unemployment(%)'] + intercept

fig.add_trace(px.line(x=merged_df['Unemployment(%)'], y=line).data[0])

r_squared = r_value**2

equation_text = f"y = {slope:.2f}x + {intercept:.2f}"
correlation_text = f"Correlation (r): {r_value:.2f}"
r_squared_text = f"R-squared: {r_squared:.2f}"

fig.add_annotation(x=0.95, y=0.9, xref="paper", yref="paper",
                   text=equation_text, showarrow=False, font=dict(size=12))
fig.add_annotation(x=0.95, y=0.8, xref="paper", yref="paper",
                   text=correlation_text, showarrow=False, font=dict(size=12))
fig.add_annotation(x=0.95, y=0.7, xref="paper", yref="paper",
                   text=r_squared_text, showarrow=False, font=dict(size=12))

fig.update_layout(
    xaxis_title='Unemployment Rate (%)',
    yaxis_title='Number of Fortune 1000 Companies',
    showlegend=True,
    hovermode='closest'
)

fig.show()

And also no relationship between unemployment rates and number of Fortune 1000 companies.

Company policies¶

Now I would like to study the impact of CEO characteristics on financial performances. In particular, I want to study how many Fortune 1000 companies have a founder CEO or a female CEO (or even both) and see if they can represent a plus compared to other companies. In fact, it is difficult to see a founder CEO or a female CEO in big companies.

In [80]:
Fortune_1000_copy2 = cp.deepcopy(Fortune_1000_copy)

founder_ceo_count = Fortune_1000_copy2['CEO_Founder'].value_counts()
women_ceo_count = Fortune_1000_copy2['CEO_Woman'].value_counts()
founder_women_ceo_count = Fortune_1000_copy2[(Fortune_1000_copy2['CEO_Founder'] == 'yes') & (Fortune_1000_copy2['CEO_Woman'] == 'yes')].shape[0]

print("Founder CEOs Count:")
print(founder_ceo_count)
print("\nWomen CEOs Count:")
print(women_ceo_count)
print("\nFounder and Women CEOs Count:")
print(founder_women_ceo_count)
Founder CEOs Count:
no     944
yes     56
Name: CEO_Founder, dtype: int64

Women CEOs Count:
no     917
yes     83
Name: CEO_Woman, dtype: int64

Founder and Women CEOs Count:
0
In [81]:
Fortune_1000_copy2 = cp.deepcopy(Fortune_1000_copy)
Fortune_1000_copy2['CEO_Type'] = 'Neither'
Fortune_1000_copy2.loc[Fortune_1000_copy2['CEO_Founder'] == 'yes', 'CEO_Type'] = 'Founder'
Fortune_1000_copy2.loc[Fortune_1000_copy2['CEO_Woman'] == 'yes', 'CEO_Type'] = 'Women'
Fortune_1000_copy2.loc[(Fortune_1000_copy2['CEO_Founder'] == 'yes') & (Fortune_1000_copy2['CEO_Woman'] == 'yes'), 'CEO_Type'] = 'Both'

colors = {'Neither': 'gray', 'Founder': 'blue', 'Women': 'purple', 'Both': 'green'}

fig = px.scatter(Fortune_1000_copy2, x='Revenue', y='Profit', color='CEO_Type', color_discrete_map=colors)
fig.update_traces(marker=dict(size=9))
fig.update_layout(title='Profits and Revenues by CEO features', xaxis_title='Revenues', yaxis_title='Profits')
fig.show()

As we can see, there are few companies with founder or women CEOs and zero companies with CEO that are both. It doesn't seem to have an impact on revenues or profits. Let's see which sectors have more founder or women CEOs.

In [82]:
ceo_type_counts = Fortune_1000_copy2.groupby(['Sector', 'CEO_Type']).size().unstack(fill_value=0)
colors = {'Neither': 'gray', 'Women': 'purple', 'Founder': 'blue'}

fig = go.Figure()

for ceo_type, color in colors.items():
    fig.add_trace(go.Bar(x=ceo_type_counts.index, y=ceo_type_counts[ceo_type],
                         name=ceo_type, marker_color=color))

fig.update_layout(title='Distribution of Fortune 1000 Companies by Sector',
                  xaxis_title='Sector',
                  yaxis_title='Frequency',
                  barmode='stack')

fig.show()

Retailing (14), Energy (9), and Health Care (9) seems to be the sectors with more women CEOs, while Technology (25) is the sector with more (25) founder CEOs. Also Financials with 12 women and 9 founder CEOs have apparently good numbers but, compared with the total number of companies from this sector, these numbers are on average.

We can say that women and founder CEOs are not very frequent to found in Fortune 1000 companies but also it seems that there are no particular differences in financial performances.

Conclusion¶

In conclusion, the analysis of Fortune 1000 companies has provided an insight into their financial performance, sector dynamics, geographical distribution, and leadership patterns. By examining key financial indicators and identifying sectors that have performed well, we gained a better understanding of the factors contributing to their success. Moreover, the analysis shed light on the presence of founder CEOs and women CEOs, emphasizing diversity and leadership patterns within these companies.

However, it is important to acknowledge that this analysis represents a snapshot of the Fortune 1000 companies based on the available data. Further research and analysis can be conducted to explore deeper into specific aspects and uncover additional insights.

Overall, this analysis contributes to enhancing our understanding of Fortune 1000 companies and the factors that impact their performance and growth.